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Abstract 

In the following article we consider approximate Bayesian parameter inference for observation driven time se- 
ries models. Such statistical models appear in a wide variety of applications, including econometrics and applied 
mathematics. This article considers the scenario where the likelihood function cannot be evaluated point-wise; 
in such cases, one cannot perform exact statistical inference, including parameter estimation, which often re- 
quires advanced computational algorithms, such as Markov chain Monte Carlo (MCMC). We introduce a new 
approximation based upon approximate Bayesian computation (ABC). Under some conditions, we show that as 
n — > ao, with n the length of the time series, the ABC posterior has, almost surely, a maximum a posteriori 
(MAP) estimator of the parameters which is different from the true parameter. However, a noisy ABC MAP, 
which perturbs the original data, asymptotically converges to the true parameter, almost surely. In order to 
draw statistical inference, for the ABC approximation adopted, standard MCMC algorithms can have acceptance 
probabilities that fall at an exponential rate in n and slightly more advanced algorithms can mix poorly. We 
develop a new and improved MCMC kernel, which is based upon an exact approximation of a marginal algo- 
rithm, whose cost per-iteration is random but the expected cost, for good performance, is shown to be 0(n 2 ) 
per-iteration. We implement our new MCMC kernel for parameter inference from models in econometrics. 

Key Words: Observation Driven Time Series Models, Approximate Bayesian Computation, Asymptotic Con- 
sistency, Markov Chain Monte Carlo. 



1 Introduction 

Observation driven time-series models, introduced by [4], has a wide variety of real applications, including economet- 
rics (GARCH models) and applied mathematics (inferring initial conditions and parameters of ordinary differential 
equations). The model can be described as follows. We observe {Yfc}fc e N i ^fc G Y which are associated to a dynamic 
system {Xk}k£N , X k £ X which is potentially unknown. Define the process Xk} n >o (with y$ some arbitrary 
point on Y) on a probability space (0, J^Pg), where, for every 9 € O C M. de , Pg is a probability measure. Denote 
by &k — cr({Y n , A„} <n<fc)- The model is defined as, for k £ N 

P(Y fe+1 G A\& k ) = f H e (x k ,dy) AxXe^ 

J A 

X k+ i = <5> e (Xk,Y k+1 ) 
¥g(X Q G B) = [ Ue(dx) YxBg# 

J B 

where H : 9 x X x <j(Y) -> [0, 1], $ : 6 x X x Y -> X and for every 9 £ 9, U e £ V(X) (the probabilities on X). 
Throughout, we assume that for any (x, 9) £ X x O H (x, ■) admits a density w.r.t. some a— finite measure /i, which 
we denote as h e (x,y). Next, we define a prior probability distribution 3 on (O, #(©)), with Lebesgue density £. 
Thus, given n observations yi-.n := (j/i, . . . , y n ) the object of inference is the posterior distribution on O x X: 

Il(d(9,x )\y Un ) cx ( f[h e (^ e (y .. k - 1 )(x ),y k ) \lL g {dx o )£(0)ffl (1) 



where we have used the notation $ e (yo-.k-i)(xo) = Q e o ■ ■ ■ o & e (xo,yi) and dO is Lebesgue measure. In most 
applications of practical interest, one cannot compute the posterior point-wise and has to resort to numerical 
methods, such as MCMC, to draw inference on 8 and/or xq. 

In this article, we are not only interested in inferring the posterior distribution, but the scenario for which 
h (x,y) cannot be evaluated point- wise, nor do we have access to an unbiased estimate of it (it is assumed we can 
simulate from the associated distribution). In such a case, it is not possible to draw inference from the true posterior, 
even using numerical techniques. The common response in Bayesian statistics, is now to adopt an approximation 
of the posterior using the notion of approximate Bayesian computation (ABC); see [13] for a recent overview. ABC 
approximations of posteriors are based upon defining a probability distribution on an extended state-space, with 
the additional random variables lying on the data-space and usually distributed according the true likelihood. The 
closeness of the ABC posterior distribution is controlled by a tolerance parameter e > and often the approximation 
is exact as e — > 0. 

In this paper, we introduce a new ABC approximation of observation driven time-series models, which is closely 
associated to that developed in [8] for hidden Markov models (HMMs) and later for static parameter inference 
from HMMs [5|. This latter ABC approximation is particularly well behaved and a noisy variant (which pertrubs 
the data; see e.g. [5]) is shown under some assumptions to provide maximum-likelihood estimators (MLE) which 
asympotically in n are the true parameters. The new ABC approximation that we develop is studied from a 
theoretical perspective. Relying on the recent work of [Hj we show that, under some conditions, as n — > oo, with n 
the length of the time series, the ABC posterior has, almost surely, a MAP estimator of 8 which is different from the 
true parameter 8* say. However, a noisy ABC MAP of 8 asymptotically converges to the true parameter, almost 
surely. These results establish that the particular approximation adopted is reasonably sensible. 

The other main contribution of this article is a development of a new MCMC algorithm designed to sample 
from the ABC approximation of the posterior. Due to the nature of the ABC approximation it is easily seen that 
standard MCMC algorithms (e.g. 12J) will have an acceptance probability that will fall at an exponential rate in n. 
In addition, more advanced ideas such as those based upon the 'pseudo marginal' [3] , have recently been shown to 
perform rather poorly in theory; see . These latter algorithms are based upon exact approximations of marginal 
algorithms [TJH], which in our context is just sampling 8,x . We develop an MCMC kernel, related to recent work 
in [10] , which is designed to have a random running time per-iteration, with the idea of improving the exploration 
ability of the Markov chain. We show that the expected cost per iteration of the algorithm, under some assumptions 
and for reasonable performance, is 0(n 2 ), which compares favourably with competing algorithms. We also show, 
empirically, that this new MCMC method out-performs standard pseudo marginal algorithms. 

This paper is structured as follows. In Section [2] we introduce our ABC approximation and give our theoretical 
results on the MAP estimator. In Section [3] we give our new MCMC algorithm, along with some theoretical 
discussion about its computational cost and stability. In Section [4] our approximation and MCMC algorithm is 
illustrated on toy and real examples. In Section [5] we conclude the article with some discussion of future work. The 
proofs of our theoretical results are given in the appendix. 



2 Approximate posteria using ABC approximations 

2.1 ABC approximations and noisy ABC 

As it was emphasised in Section[l] we are interested in performing inference when h?(x, y) cannot be evaluated point- 
wise, nor do we have access to an unbiased estimate of it. We will instead assume it is possible to sample from h . 
In such scenaria, one cannot use standard simulation based methods. For example, in a standard MCMC approach 
the Metropolis-Hastings acceptance ratio cannot be evaluated, even though it may be well-defined. Following the 
work in [5, 8 for hidden Markov models, wc introduce an ABC approximation for the density of the posterior in 
([!]) as follows: 

n 

x \y 1:n ) cx J] h d * (4> e (yo-.k-!) (x ),y k ) tfso, 8), (2) 
k=l 

with e > and 

, Bf , <>, w \ , Sb (y k ) he (® e (yo:k-i)(xo),y)n(dy) 

h e ^ e (y 0:k _ l )(x ),y k ) = -^^ , (3) 

where we denote B e (y) as the open ball centred at y with radius e and write [i{B e {y)) — J B fj,(dx). When /i is 
the Lebesgue measure, n(B e (y)) corresponds to the volume of the ball B e (y). 
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In general we will refer to ABC as the procedure of performing inference for the posterior in ([2|. In addition, 
we will call noisy ABC the inference procedure that uses instead of the original observation sequence a perturbed 

one, namely \ Y k r ; where each Y k is given by 
I J k>0 

Y k =Y k + eZ k , 

with each Z k is identically independently distributed (i.i.d.) uniformly on B e (0) (shorthand Z k ~ Ubjq))- 
2.2 Consistency results for the MAP estimator 

In this section we will investigate some interesting properties of the ABC posterior in In particular, we will look 
at the asymptotic behaviour with n of the resulting MAP estimators for 8. The properties of the MAP estimator 
reveal information about the mode of the posterior distribution as we obtain increasingly more data. To simplify 
the analysis in this section we will assume that: 

(Al) — xq is fixed and known, i.e. Tl(dxo) — S x (dxo), where S x denotes the Dirac delta measure on X and ieX 
is known. 

— £(x, •) is bounded and positive everywhere in 0. 

— the observations actually originated from the true model model for some 8* G 8, i.e. we look at a 
well-specified problem. 

— H and h do not depend upon 8. Thus we have the following model recursions for the true model: 

F e *(Y k+1 eA\& k ) = [ H{x k ,dy), 4xXe#, 

J A 

X k+1 = $ e * (X k ,Y k+1 ), (4) 
where we will denote associated expectations to Pg» as Eg* . 

In addition, for this section we will introduce some extra notations: (X, d) is a compact, complete and separable 
metric space and (0,d) is a compact metric space, with C M. ds . For two measures v and A of bounded variation 
denote the convolution v*\ (/) = J f(z + r)v(dz)\(dr). Let also Q e be the probability law associated to the random 
sequence {Z k } ke % , where each Z k is an i.i.d. sample from the uniform distribution defined on B e (0). 

We proceed with some additional technical assumptions: 

(A2) {X k , Y k } ke z is a stationary stochastic process, with {Y k } ke 2, strict sense stationary and ergodic, following Q. 

(A3) For every (x, y) G X x Y, 8 i— > $ e (x,y) is continuous. In addition, there exist < C < oo such that for any 
(x, x') G X, sup yeY \H X > v) ~ h(x' , y)\ < Cd(x, x'). Finally < h < h(x, y) < h < oo, for every (x, y) G X x Y. 

(A4) There exist a measurable g : Y — > (0,1), such that for every (8, y, x, x') G x Y x X 2 

d($ 9 (x,y),$ (x',y)) < e (y)d(x'x'). 

(A5) The following statements hold: 

1. H(x, •) = H(x' , •) if and only if x — x' 

2. If $ 9 (yi oo:0 )(a;) = $ e '(y_ oo:0 )(a;) holds P e » *Q 6 -a.s., then 8 = 8' . 

Assumptions (A[2][5| and the compactness of are standard assumptions for maximum likelihood estimation (ML) 
and they can be used to show the uniqueness of the maximum likelihood estimator (MLE); see [6] for more details. 
Therefore, if the prior £(#) is bounded and positive everywhere on it is a simple corollary that the MAP estimator 
will correspond to the MLE. In the remaining part of this section we will adapt the analysis in [B] for MLE to the 
ABC setup. 

In particular, we are to estimate 8 using the log-likelihood function: 

1 " 

leAw.n) ■= -J2 lo % (^($ e (2/o:fc-i)(x),y/c)) 

k=l 

We define the ABC-MLE for an rt-long sequence as 

8 n ,x,e = a,Tgma,x eee le, x (yi:n)- 
We proceed with the following proposition: 



3 



Proposition 2.1. Assume (J^T^. Then for every x € X and fixed e > 

lim d(6 n>x , e , 0*) = P e * -a.s. 

71— >-00 

w/iere 9 e = argmax eee E e .[log(/i e ($ (y_ oo: o)(a;), Yi))]. 

The result establishes that the estimate will converge to a point, which is typically different to the true parameter. 
Hence there is an intrinsic asymptotic bias for the plain ABC procedure. To correct this bias, consider the noisy 

ABC procedure, of replacing the observations by Yfe = Yk + tZk, where Zk '~ ' Ub 1 (q)- The noisy ABC MLE 
estimator is then: 



1 

argmax eee - ^ log (V($ e (y 0:fc „i)(x), Vh, 

k ' 



We have the following result: 
Proposition 2.2. Assume 

(^HP- Then for every x G X and fixed e > 
lim d(0 W) 0*) =0 P e ,*Q e -a., 



n— J-oo 



The result shows that the noisy ABC MLE estimator is asymptotically unbiased. Therefore, given that in our 
setup the ABC MAP estimator corresponds to the ABC MLE we can conclude that the mode of the posterior 
distribution as we obtain increasingly more data is converging towards the true parameter. Finally we note that 
our assumptions indeed pose some restrictions, but these are shown to be realistic for a few interesting models in 
[6 . In addition, the main purpose of this result is to motivate the use of the approximate posterior in ^ when the 
observation sequence is long or its marginal likelihood is quite informative. 

3 Computational Methodology 

Recall that we formulated in the ABC posterior written in ([5]). One can rewrite the approximate posterior in Q: 



TT e {9,X \yi:n) = f e / w m , Aa > 

J Pe. x Ayi-.n)K{X(i,e)dxQdO 



with 



n Tit TT / \ 

Pl,xo(W:«) = J II ^'(B* (0)) Vfc-Ofo), U k )du 1:n . 

Note we have just used Fubini's theorem to rewritte the likehood p\ x (yi :n ) as an integral of a product instead of 
a product of integrals Ylje=i h e ' e {f& 6 (yo-.k-i) (xn),Vk) shown in ([2])-([3]). In this paper we will focus only on MCMC 
algorithms and in particular on the Metropolis-Hastings (M-H) approach. In order to sample from the posterior w e 
one runs an ergodic Markov Chain with the invariant density being ir e . Then after a few iterations when the chain 
has reached stationarity, one can treat the samples from the chain as approximate samples from 7r c . This is shown 
in Algorithm [lj where for convenience we denote 7 = (9,x ). The one-step transition kernel of the MCMC chain is 
usually described as the M-H kernel and follows from Step 2 in Algorithm [T] 

Unfortunately p@ x (yi :n ) is not available analytically and cannot be evaluated, so this rules out the possibility of 
using of traditional MCMC approaches like Algorithm [TJ However, one can resort to the so called pseudo-marginal 
approach whereby unbiased estimates of Pg x (yi-.n) are used instead within an MCMC algorithm. We will refer 
to this algorithm as ABC-MCMC. The resulting algorithm can be posed as one targeting a posterior defined on 
an extended state space, so that its marginal coincides with 7r e (6', x \yi :n ). We will use these ideas to present 
ABC-MCMC as a M-H algorithm which is an exact approximation to an appropriate marginal algorithm. 

To illustrate an example of these ideas, we proceed by writing a posterior on an extended state-space 8 x X x Y" 
as follows: 

n 

<(6»,x ,ui : „|yi : „) oc JJ l Br{yk) {u k )h 6 ($ e (y -.k-i) (x ),Vk) (,(x ,0). (5) 

k=l 

It is clear that ^ is the marginal of |5]) and hence the similarity in the notation. As we will show later in this 
section, extending the target space in the posterior as in |5]) is not the only and certainly not the best choice. 
We emphasise that the only essential requirement for each choice is that the marginal of the extended target is 
7r e (#, Xo\yi :n ), but one should be cautious because the particular choice will affect the mixing properties and the 
efficiency of the MCMC scheme that will be used to sample from 7r^(6*, x , Ui :n \yi :n ) in ^ or another variant. 
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Algorithm 1 A marginal M-H algorithm for 7r c (7|y 1: „) 



1. (Initialisation) At t = sample 70 ~ £. 

2. (M-H kernel) For t > 1: 

• Sample 7'|7t_i from a proposal Q(jt-i, ■) with density q^t-x, ■)• 

• Accept the proposed state and set 74 = 7' with probability 

Py(yi:J ^ e(Y)9(7',7t-i) 



p^.^yim) ^(74-1)9(74-1, 7')' 

otherwise set 74 = 7t-i- Set t = t + 1 and return to the start of 2. 



Algorithm 2 M-H Proposal for basic ABC MCMC 



• Sample 7^7 from a proposal Q(~f, •) with density (7(7, •). 

• Sample u x . n from a distribution with joint density Ofc=i h 6 ($ 9 {yo-.k— i)(po)i u k) 

• Accept the proposed state (7 , ,u' 1 . n ) with probability: 



1A nL^B.fa^K) x g(Y)g(Y,7) 

nLi^fo^K) ^(7)9(7,7')' 



3.1 Standard approaches for ABC-MCMC 

We will now look at two basic different choices for extending the ABC posterior while keeping the marginal fixed 
to 7r e (0, Xo\yi :n ). In the remainder of the paper we will denote 7 = (0,xq) as we did in Algorithm]!] 
Initially consider the ABC approximation when be extended to the space 6 x X x Y" : 

cl \ el \ TT" I B g («fc)(" fc ) n 
£ / 1 s KKDV^VV.n) llfc=l M(-Be(0)) TT ,9i^Bl \l \ \ 

77 (7 ' Ul: " lyi: ' l) = Wfai:n)*r ^ n * (* (^-i)(*oW). 



fc=i 



Recall one cannot evaluate h 9 (<& 6 {xo){uk),Uk) and is only able to simulate from it. In Algorithm |2]) we present 
a natural M-H proposal that could be used to sample from 7T e (7, Uun\yi:n) instead of the one shown Step 2 at 
Algorithm]!] Note that this time the state of the MCMC chain is composed of (7, u\- n ). Here each Uk assumes the 
role of an auxiliary variable to be eventually integrated out at the end of the MCMC procedure. 

However, as n increases, the M-H kernel in Algorithm [2] will have an acceptance probability that falls quickly 
with n. In particular, for any fixed 7, the probability of obtaining such a sample will fall at an exponential rate in 
n. This means that this basic ABC MCMC approach will be inefficient for a moderate value of n. 

This issue can be dealt with by using N multiple trials, so that at each k, some auxiliary variables (or pseudo- 
observations) are in the ball B c {yk). This idea originates from [31 [12] and in fact augments the posterior to a larger 
state-space, 6 x X x Y nN , in order to target the following density: 

/ \ 6 / \ TT™ Sjli I B e (y fc )("fc) n N 
UN I \ 7r (7)P 7 (?/l:«) llfe=l JV M (B e (0)) 



fe=ij=i 



Again, it is easy to show that the marginal of interest 7r e (7|yi :n ) is preserved, i.e 



7T e (7|yi:n) = / ^{l, "lm \Vl:n)du\:^ = \ 7T C (7, Uun \Vl:n)dUu. n - 

In Algorithmplwe present an M-H kernel with invariant density 7r e . The state of the MCMC chain now is (7, uj:^) . 
We remark that as N grows, one expects to recover the properties of the ideal M-H algorithm in Algorithm [l] 
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Algorithm 3 M-H Proposal for ABC with N trials 



• Sample 7^7 from a proposal Q(7, •) with density (7(7, •). 

• Sample u' from a distribution with joint density rifc=i IljLi ($ 9 (7/o:fc-i)(£o), u 't) ■ 

• Accept the proposed state ^7', v! with probability: 

nLi(FEf = iiB gfafc )(u'i)) vr(y) g (Y,7) 

nL^Ef^i^K)) -(7)9(7,7')' 



Nevertheless, it has been shown in [TT] that even the M-H kernel in Algorithm [3] does not always perform well. It 
can happen that the chain gets often stuck in regions of the state-space O x X where 



afe(j/i:fe,e,7) : = / h {yo-.k-i)(x ),u)du 

JB e (y k ) 



is small. 



3.2 A Metropolis-Hastings kernel for ABC with a random number of trials 

We will address this shortfall detailed above, by proposing an alternative augmented target and corresponding M-H 
kernel. The basic idea is that a random number of trials is used based on the value of ak(yv.k> £■> 7)- Then it will be 
possible to use more computational effort when the chain is at regions where otkiyi-.k, e, 7) is low. 
Consider an alternative extended target, for N > 2, e Mn := {N, N + 1, . . . , }, 1 < k < n: 



^(l)P 7 (yi:n) Uk=l M (B e (0))(m fc -l) TT ( m k - 1\ , x 

J 7!"(7)P7(Z/l : n)d7 Z^U/lin) fc = i V ^ " 1 / 

Standard results for negative binomial distrubutions (see [lH [15] for more details) imply that 



JV-1 

m k -N 



(G) 



m fc =AT 

holds and this can be used to deduce that 

TT 

ii M(S,(0))(m t - 1) 



is an unbiased estimator for p^(yi:n). In addition, from |6]) it follows that the marginal w.r.t. 7 is the one of interest: 

^(llVl-.n) = ^ £ (7,™l:n|yi:n) 

In Algorithm [5] we present a M-H kernel with invariant density n e . The state of the MCMC chain this time is 

(7, m l:n)- 

The potential benefit of this kernel is that one expects the probability of accepting a proposal is higher than the 
previous M-H kernel (for a given N) . This comes at a computational cost which is both increased and random. The 
proposed kernel is based on the N—hit kernel of [10] . which has been adapted here to account for the data being a 
sequence of observations resulting from a time series. Finally, it is important to mention that in Algorithms [3] and 
[4] generating multiple trials can be implemented very efficiently in parallel using appropriate computing hardware, 
such as multi-core processors or computing clusters. 
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Algorithm 4 M-H Proposal with a random number of trials 



Sample 7^7 from a proposal Q(7, ■) with density (7(7, •). 



• For fc = 1, . . . , n repeat the following: sample u\, u|, . . . with probability density h e (& e (yo-.k-i)(x' ),Uk) until 
there are N samples lying in B e (yk); the number of samples to achieve this (including the successful trial) is 



id 



k ' 



Accept ('Y,rn' 1 . n ) with probability: 



rjtl 1 

1 A jjj^ K^T x 7r(7 , )g(7 , , 7 ) 
nLiS^T ^(7)9(7,7')' 



3.2.1 On the choice of A 



To implement the proposed kernel, one needs to select N. In practice to it is difficult to know a good value this a 
priori, so we present a theoretical result that can add some intuition on choosing N. Let E~ denote expectation 
w - r -t' ]lfe=i ("N~i) a k(yi:k,£, 7)^(1 - a k (yi:k,e,l)) mk ~ N given 7,7V. We will also pose the assumption: 

(A6) For any fixed e > 0, 7 G 6 x X, we have ctk(yi-.k, e, 7) > 0. 

The the following result holds, whose proof can be found in the appendix.: 

Proposition 3.1. Assume (Afy and let (3 G (0, 1), n > 1 andN>^V 3. Then for fixed (7, e) G 8 x X 
have 



we 



E 



■y,N 



n 



1 

k=l A/fc-1 



n 



n ak(yi:k,e,l) 



fe=l 



N-l 



< 



Cn 



where C = 



The result shows that one should set N = O(n) for the relative variance not to grow with n, which is unsuprising, 
given the conditional independence structure of the m 1: „. To get a better handle on the variance, suppose n = 1, 
then for 7 fixed 

"1(2/1, 7) (1 - "1(2/1, e> 7)) 



A' 



r ar 7iA r[^^I Bi!(!/l) ( U {) 



i=i 



TV 



For the new approach one can show 



V 



AT- 1 
Mi - 1 



< 



Qi(2/i> £ >7) 2 
(/V -2) 



Not taking into account the computational cost, one prefers this new estimate with regards to variance if 



A^-2 



< 



I - Qi(yi,e,Q 
(j/i,e,7) 



which is likely to occur if 0:1(2/1,6,7) is not too large (recall we want e to be small, so that we have a good 
approximation of the true posterior) and A^ is moderate - this is precisely the scenario in practice. 



Remark 3.1. It is easily shown that the relative variance associated to the estimate YYk=i 
is 



= l R B e (y k ) 



K) 



n[ 



1 



N - 1 



1. 



Note this quantity is not uniformly upper-bounded in 7 unless inffe j7 otk{yi-.k, 7) > C > ; which may not occur. 
Conversely, Proposition \3.1\ shows that the relative variance of the new estimator is uniformly upper-bounded in 7 
under minimal conditions. We suspect that this means in practice that the kernel with random number of trials may 
mix faster . 
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3.2.2 Computational considerations 

As the cost per-iteration is random, we will investigate this further. We denote the proposal of 7, mi :n as Q. Let £ 
be the initial distribution of the MCMC chain and (^K* the distribution of the state at time t. In addition, denote 
by m\ the proposed state for at iteration t. Finally, we will write as E^t^g the expectation of a random 

variable proposed by Q given the simulated state at time t. We will assume that the observations are fixed and 
known. Then we have the following result: 

Proposition 3.2. Let e > 0, and suppose that there exists a constant C > such that for any n > 1 we have 
inffc ak(yi:k, 7, e ) > C, fi— a.e.. Then it holds for any N > 2, t > 1, that: 

fc=l 

The expected computational cost grows linearly with n. Thus, coupled with the result in Proposition |3.1| one 
has a cost of 0(n 2 ) per-iteration, which is comparable to many exact approximations of MCMC algorithms (e.g. pQ), 
albeit in a much simpler situation. Note also that the kernel in Algorithm [3] is expected to require a cost of 0(n 2 ) 
per iteration for reasonable performance, although this cost here is deterministic. As mentioned above, one expects 
the approach with random number of trials to work better with regards to the mixing time, especially when the 
values of ojfc(j/i : fc, e, 7) are not large. We attribute this to Algorithm [4] providing a more 'targetted' way to use the 
simulated auxiliary variables. This will be illustrated numerically in Section [3] 



3.2.3 Relating the variance of the estimator or pt{yx-. n ) with the efficiency of ABC-MCMC 

A comparison of our results with the interesting work in [7] seems relevant. There the authors deal with a more 
general context and show that we should choose A as a particular asymptotic (in N) variance; the main point is 
that the (asymptotic) variance of the estimate of p~(j/i :n ) should be the same for each 7. We conjecture that in our 
set-up one should choose N such that the actual variance of the estimate of pt(yi :n ) is constant with respect to 7. 
In this scenario, on inspection of the proof of Proposition |3.1[ for a given 7, one should set N to be the solution of 



n 



1 



k=l 



(N - l) n (N - 2)" (A-l) 2 ™ 



= C 



for some desired (upper-bound on the) variance C (whose optimal value would need to be obtained). This makes 
N a random variable, in addition, but does not change the simulation mechanism. Unfortunately, one cannot do 



this in practice, as the ak{yi:ki e tl) are unknown. Taking into account Remark 3.1 this latter approach may not 
be so much of a concern in practice. 

3.2.4 On the ergodicity of the sampler 

We conclude this discussion by adding a related comment regarding the ergodicity of the proposed MCMC kernel. 
If there exists a constant C < 00 such that 

< C 7f e (d7|yi:„) - a.e. 



and the marginal MCMC kernel in Algorithm [l] is geometrically ergodic, then by [21 Propositions 7, 9] the MCMC 
kernel of Algorithm |4j is also geometrically ergodic. 

4 Examples 

4.1 Scalar normal means model 
4.1.1 Model 

For this example let each Y/., Xk, be a scalar real random variable and consider the model: 

Yk+i = 6Xk + Kk, Xk+i = Xk 
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with Xq = 1 and n k ' A/"(0,<7 2 ), where we denote Af(0,a 2 ) the zero mean normal distribution with variance a 2 . 
The prior on 9 is A/"(0, <j)). This model is usually referred to as the standard normal means model in one dimension 
and the posterior is given by: 

2 n 
k=l 

where a 2 — + . Note that if Y k Af(6*,a 2 ), then the posterior on 6> is consistent and concentrates 

around 8* as n — > oo. 

The ABC approximation after marginalizing out the auxiliary variables has a likelihood given by: 

« fc )4iiK^)- F (^)] 

fe=i 

where i* 1 is the standard normal cumulative density function. Thus, this is a scenario where we can perform the 
marginal MCMC. 



4.1.2 Simulation Results 

Three data sets are generated from the model with n £ {10,100,1000} and a 2 = 1. In addition, for e = 1 wc 
perturb the data-sets in order to use them for noisy ABC. For the sake of comparison, we also generate a noisy 
ABC data-set for e = 100. We will also use a prior with <j> — l. 

We run the new MCMC kernel (the proposal in Algorithm [4]- we will frequently use the expression 'Algorithm' 
to mean an MCMC kernel with the given proposal mechansim of the Algorithm), old MCMC kernel (Algorithm 
[3]) and a Marginal MCMC algorithm which just samples on the parameter space R (i.e. the posterior density is 
proportional to Pg{yi:n)^(P))- Each algorithm is run with a normal random walk proposal on the parameter space, 
with the same scaling. The scaling chosen yields an acceptance rate of around 0.25 for each run of the marginal 
MCMC algorithm. The new MCMC kernel is run with N = n and the old with a slightly higher value of N so that 
the computational times are about the same (so for example, the running time of the new kernel is not a problem 
in this example). The algorithms are run for 10000 iterations and the results can be found in Figures [Ij3] 

In Figure[l]the density plots for the posterior samples on 9, from the marginal MCMC can be seen for e 6 {1, 100} 
and each value of n. When e = 1, we can observe that both ABC and noisy ABC both get closer to the true posterior 
as n grows. For noisy ABC, this is the behavior that is predicted in Section [2~2| For the ABC approximation, 
following the proof of Theorem 1 in [5], one can see that the bias falls with e; hence, in this scenario there is not a 
substantial bias for the standard ABC approximation. When we make e much larger a more pronounced difference 
between ABC and noisy ABC can be seen and it appears as n grows that the noisy ABC approximation is slightly 
more accurate (relative to ABC). 

We now consider the similarity of the new and old MCMC kernels to the marginal algorithm (i.e. the kernel 
both procedures attempt to approximate), the results are in Figures [2j3j With regards to both the density plots 



(Figure [2| and auto-correlations (Figure [3| we can see that both MCMC kernels appear to be quite similar to the 
marginal MCMC. It is also noted that the acceptance rates of these latter kernels are also not far from that of the 
marginal algorithm (results not shown). These results are unsuprising, given the simplicity of the density that we 
target, but still reassuring; a more comprehensive comparison is given in the next example. Encouragingly, the new 
and old MCMC kernels do not seem to noticably worsen as n grows; this shows that, at least for this example, the 
recommendation of N = 0(n) is quite useful. We remark that whilst these results are for a single batch of data, 
the results are consistent with other data sets. 

4.2 Real Data Example 
4.2.1 Model 

Set, for (Y k ,X k ) eMxl+ 

Y k+1 = n k k e N 
X k+1 = /?o + f3iX k + A>Yfc 2 +1 fee N 

where K k \x k ~ 5(0, x k , <fi, V2) (i-e. a stable distribution, with location 0, scale X k and asymmetry and skewness 
parameters ipi,(p 2 ). We set 

X ~Qa{a,b), PotPufc ~ Ga(c,d) 
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where Ga(a, b) is a Gamma distribution with mean a/b and 9 = (/?o:2) € (R + ) 3 . This is a GARCH(1,1) model with 
an intractable likelihood. 

4.2.2 Simulation Results 

We consider daily log-returns data from the S&P 500 index from 03/1/11 to 14/02/13, which constitutes 533 data- 
points. In the priors, we set a = c = 2 and b = d = 1/8, which are not overly informative. In addition, tpi — 1.5 and 
if2 = 0. We consider e g {0.01,0.5} and only a noisy ABC approximation of the model. Algorithms [3] and [I] are to 
be compared. The MCMC proposals on the parameters are random-walks on the log-scale and for both algorithms 
we set N = 250. It should be noted that our results are fairly robust to changes in N € [100,500], which are the 
values we tested the algorithm with. 

In Figure[4]we present the trace-plot of 50000 iterations of both MCMC kernels when e = 0.5. Algorithm [3] took 
about 0.30 seconds per iteration and Algorithm [4] took about 1.12 seconds per iteration We modified the proposal 
variances to yield an acceptance rate around 0.3. The plot shows that both algorithms appear to move across the 
state-space in a very reasonable way. The new algorithm takes much longer and in this situation does not appear 
to be required. This run is one of many we performed and we observed this behaviour in many of our runs. 

In Figure [5] we can observe the trace plots from a particular (typical) run when e = 0.01. In this case, both 
algorithms are run for 200000 iterations. Algorithm [3] took about 0.28 seconds per iteration and Algorithm [4] took 
about 2.06 seconds per iteration; this issue is discussed below. In this scenario, considerable effort was expended 
for Algorithm [3] to yield an acceptance rate around 0.3, but despite this, we were unable to make the algorithm 
traverse the state-space. In contrast, with less effort, Algorithm [I] appears to perform quite well and move around 
the parameter space (the acceptance rate was around 0.15 versus 0.01 for Algorithm [3]) . Whilst the computational 
time for Algorithm [4] is considerably more than Algorithm [3j in the same amount of computation time, it still 
moves more around the state space; algorithm runs of the same length are provided for presentational purposes. 
We remark that, whilst we do not claim that it is 'impossible' to make Algorithm [3] mix well in this example, we 
were unable to do so and, alternatively, for Algorithm [4] we expended considerably less effort for very reasonable 
performance. This example is typical of many runs of the algorithm and examples we have investigated and is 
consistent with the discussion in Section [3.2.2| where we stated that Algorithm [4] is likely to out-perform Algorithm 
[3] when the otk{yi:k, e, 7) are not large, which is exactly the scenario in this example. 

Turning to the cost of simulating Algorithm |4j for the case e = 0.5 we simulated the data an average of 148000 
times (per-iteration) and for e = 0.01 this figure was 330000. In this example signifcant effort is expended in 
simulating the mi :n . This shows, at least in this example, that one can run the algorithm without it failing to 
sample the m\ :n . The results here suggest that one should prefer Algorithm [4] only in challenging scenarios, as it 
can be very expensive in practice. 

Finally, we remark that the MLE for a Gaussian Garch model, is /?o : 2 = (4.1 x 10 -6 , 0.16, 0.82). This differs to 
the posterior means, which may indicate that a stable distribution could be useful for modelling the observations 
for this class of models. 

5 Conclusions 

In this article we have considered approximate Bayesian inference from observation driven time series models. We 
looked at some consistency properties of the corresponding MAP estimators and also proposed an efficient ABC- 
MCMC algorithm to sample from these approximate posteriors. The performance of the latter was illustrated using 
numerical examples. 

There are several interesting extensions to this work: 

• the asymptotic analysis of the ABC posterior in Section |2.2| can be further extended. For example, one may 
consider Bayesian consistency or Bernstein Von-Mises theorems, which could provide further justification to 
the approximation that was introduced here. Alternatively, one could look at the the asymptotic bias of the 
ABC posterior w.r.t. e or the asymptotic loss in efficiency of the noisy ABC posterior w.r.t. e similar to the 
work in [5] for hidden Markov models. 

• the geometric ergodicity of the presented MCMC sampler can be further investigated in the spirit of [21 Qj] • 

• an investigation to extend the ideas here for sequential Monte Carlo methods should be beneficial. This has 
been initiated in [3] in the context of particle filtering for a different class of models. 
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A Proofs for Section 



Proof. [Proof of Proposition |2.1| The proof of lim„_j. 00 A{9 n x e , 0*) = Pg* — a.s. follows from [5J Theorem 21] if 
we can establish conditions (Bl-3) for our perturbed ABC model. Clearly (Bl) and part of (B2) holds. (B3-i) hold 
via Lemma 21] via (A[4|. We first need to show that for any y G Y that x H> h e (x,y) is continuous. Consider 

\h e (x,y) - h e (x 7 y)\ = fJ }, n ^ \ [ (h(x,y) - h(x' ,y)) fi(dy)\. 

/Z(-tf e (U)J J Be (y) 

Let e > 0, then, by (J^ there exists a 6 > such that for d(x, x') < 5 

sup \h(x,y) - h(x',y)\ < e 

and hence for (x,x') as above 

\h<(x,y)-h<tf,y)\<e. 

which establishes (B2) of [6]. Now, for (B3-ii) of [6], we note that as h < h e (x,y) < h < oo (see (A|3j)) the log 
function is Lipshitz and 

|io g (/ l e ($ e (y 1;fe _ 1 )( a; ) ) y fe ))-iog(^($ e (y_ oo:fc _ 1 )(a ; ) ! r fe ))| < c|fc e ($ fl (y 1:fc _i)(a;),n) -/i($ e (y_oo:fe-i)(a;),y fe )| 

for some C < oo that does not depend upon Y—oo^—i, Y k , x, e. Now 

|/i e ($ e (y 1:fc _ 1 )(a ; ) i y fe )-/i($ e (F_ 00:fe _ 1 )(x),y fe )| = (/.(^(O)))- 1 ! / [h^%Y 1:k ^)(x),y)-h^%Y^.. k _ 1 )(x),yMdy)\ 

JB t (Y k ) 

and 

(/^(yx^XaO^)-^*^ 

B t (Y k ) yGY 

Thus, by (A|I} and the fact that (B3-i) of [5] holds: 

iimsu P |iog(ft e ($ e (y 1:fc _ 1 )(a ; ),y £ ))-iog(/i e ($ e (y_ 00:fc _ 1 )(x),y £ ))| = p 9 .-o. s . 

Note, finally that (B3-iii) trivially follows by ft 6 (a;, y) < h < oo. Hence we have proved that 

lim d(9„, !£ ,e;) = P 9 *-a.s.. 

n— voo 

□ 



Proof. [Proof of Proposition 2.2 This result follows from [5J Proposition 23]. One can establish assumptions (Bl-3) 



of [B] using the proof of Proposition 2.1 Thus we need only prove that 

H e (x, •) = H e (x',-), &x = x'. 

Now, for any A G y 

He ^ A ) = TJWm\ 1 1 / H(x,du)Mdy). 
B y S B<l [y) H & du ) = S Bc (y) Hix'.du) ^x = x' 7 so 

H e (x, A) = 1 / [ / H(x', du)]ii{dy) «. x = x' 
fJ,{B e (U)) J A J Bc (y) 

which completes the proof. □ 
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B Proof for Section 



Proof. [Proof of Proposition 3.1 We have 

llfc=l M fc -1 



E 



7, AT 



k=l 



1 



]V-1 



U.lfe=l 7V-1 J 



I1 E 



7, AT 



fc=l 



(Af fc - l) 2 



n 

k=l 



«fc(yi : fc,e,7) 
AT - 1 



Now, by [H [lg (AT > 3) for any fc > 1 
and thus clearly 



1 



E 



«fc(?/l:fc,£,7) 2 

(M fc -l)(M fc -2)J (AT-l)(iV-2) 
afc(2/i : fe,e,7) 2 



7,Af 



< 



(M fc ~l) 2 J " (AT-l)(AT-2)' 



hence 



E 



n n 1 \ 2 



\ TT™ "fc(l/i:fc,e,7) 
L Mlfc=l 



2V-1 



< 



(N -lf n { 



(A r -l)"(A r -2)« (Af-1) 2 



Now the R.H.S.of Q is equal to 



Now, we will show 



N n - 2nN n ~ 1 + X;™ =2 (")A r " _i (-2) i 

E(")^[(-ir-(-2r]<o. 



(7) 



(8) 



(9) 



The proof is given when n is odd. The case n even follows by the proof as n — 1 is odd and the additional term is 
negative. Now we have for fee {1, 3, . . . , (n — l)/2} that the sum of consecutive even and odd terms is equal to 



N n-2k n \ 



(n-2fe- l)!(2fe)! 



N{\ - 2 2fc )(2fc + 1) - (2 2fc+1 - l)(n - 2k) 
(n-2fe)(2fe + l)AT 



which is negative as 



N > 



(2 2fc+1 - l)(n-2fc) 
(l-2 2fc )(2fc + l) ' 
Thus we have established We will now show that 



( n )N n -\-2) 1 > 0. 



(10) 



Following the same approach as above (i.e. n is odd) the sum of consecutive even and odd terms is equal to 



(n-2k- l)!(2fc)! 



N(2k + 1) - 2(n - 2k) 
[n ~ 2k)(2k + 1)N 



This is positive if 



n — 2k n 

N > < -. 

~ 2fc + 1 ~ 3 



as N > j^pj and 6 > (1 - 0) it follows that N > n/3 >(n- 2k) /(2k + 1); thus one can establish ([10 1. 
Now returning to Q and noting ([s]), Q and ( fToj ) , we have 



E 



■j,N 



n n 1 x 2 

M fc -1 _ i 

nn t»fc(ai:fc,e,7) 
fc=l iV-1 



< 



nA^"- 1 _ n 
N n - 2nN ~ N — 2n 



as N > 2/(1 - 0) it follows that n/(N - 2n) < Cn/N and we conclude. 



□ 
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Proof. [Proof of Proposition 3.2 We have 
IS 



n i. n / i \ 

E M *i = / E(E m ^{nnv-ir fc(yi ^ 

fe=i J(exx) 2 M „ fe=1 fe=1 V / 

= / (E^^ — T )?(7,y)c^(*y)*/<^. 

where we have used the expectation of a negative-binomial random variable and applied inf^ ak{yi-.k, 7, e) > C, 
//— a.e. in the inequality □ 
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(a) ra = 10, e = 1 



(b) n = 10, e= 100 




— 2 —1 O 1 2 



(e) n = 1000, e = 1 (f ) n = 1000, e = 100 

Figure 1: Marginal MCMC Density Plots. In each plot the true posterior (black), noisy ABC (orange), ABC (green) 
densities of 9 are plotted, for different values of n (10, 1st row, 100, 2nd row, 1000, 3rd row) and e (1, 1st column, 
100, 2nd column). The black vertical line is the value of^that generated the data. 



(a) n = 10, ABC 



(b) n = 10, Noisy ABC 





(e) n = 1000, ABC (f) n = 1000, Noisy ABC 

Figure 2: MCMC Density Plots. In each plot the true posterior (black), ABC (green, 1st col) or noisy ABC (orange, 
2nd col) densities of 9 are plotted, for different values of n (10, 1st row, 100, 2nd row, 1000, 3rd row). In addition 
the plots are for the new and old (the *) MCMC kerneli"? The black vertical line is the value of 9 that generated 
the data. Throughout e = 1 






(e) n = 1000, Marginal (f) n = 1000, Standard 

Figure 3: Marginal MCMC and Standard Auto-Correlation Plots. In each plot in column 1 the auto-correlation 
for every 10th iteration is plotted for noisy ABC (orange) and ABC (green). In each plot in column 2 the auto- 
correlation over the same period is plotted for both the new and old MCMC kernels, for both noisy ABC (orange 
for new and blue for old) and ABC (green for new, red for old). Three different values of n are presented and e = 1 
throughout. 
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(c) ft (d) I3 2 

Figure 4: Trace, e = 0.5. We run both algorithms for 50000 iterations, the new kernel is in orange trace and 
N = 250 in both cases and the algorithms are run the S & P 500 data. 
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Figure 5: Trace, e = 0.01. We run both algorithms for 200000 iterations, the new kernel is in orange trace and 
N = 250 in both cases and the algorithms are run the S & P 500 data. 



18 



